feat: add parallel chunking example #525
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds an experimental
parallel-chunkexample binary that is feature gated with aparallel-chunkingflag.The changes include:
chunk_file_paralleltoChunkerparallel_chunking.rsRun
build the examples
chunk a file
the output should be identical to the original chunk binary output
Benching
below is a small benchmark script that compares
chunkandparallel-chunkruntimes on a generated 1GB input file using hyperfine../bench.sh 1 # the number indicates the random file gb sizeBuilding examples... Compiling deduplication v0.14.5 (/Users/drbh/Projects/xet-core-tmp/deduplication) Compiling cas_object v0.1.0 (/Users/drbh/Projects/xet-core-tmp/cas_object) Compiling cas_client v0.14.5 (/Users/drbh/Projects/xet-core-tmp/cas_client) Compiling hub_client v0.1.0 (/Users/drbh/Projects/xet-core-tmp/hub_client) Compiling data v0.14.5 (/Users/drbh/Projects/xet-core-tmp/data) Finished `release` profile [optimized + debuginfo] target(s) in 10.88s Compiling data v0.14.5 (/Users/drbh/Projects/xet-core-tmp/data) Finished `release` profile [optimized + debuginfo] target(s) in 5.47s Reference version... (last 10 lines) fd27a0d46a8d835b52ab6bba351b8f977ac6a31ea26f2e56f558a4def0cf41c0 40657 290e4a68705cbb8ee95274f027bd17356438c4ab94c41b58ee72d22c0d6afb88 93564 5ae15e4077c349647cdc0f9f2272b91fdf6ab50f6ef67763b3c165ec925385e5 55783 8a5e75b8c173d1b466a2a66ba54853c317f8913a544c5d5ab547bff1dec5960f 59859 0b899aad6c918cd5a709c5bd5994c036f37509e8345ac503c2d059e766e06924 131072 4d0c6e8d974da6f53f959a55fbc735a484c09f6ce3dde5410ef505a590d3bbef 111800 02dc025bdb7afd7e0b32c545f1abd70015c728508c3aa6ec3a7cf08a6c52a250 89915 a602df3c6e333fdbe6434158945aa05ec82139bed8741336b128322423005112 29275 234ce19a584bb9996f3a909adada4dff8de8b30a296b8cbb4b31099af09501b3 36264 5090799ec0d6b1dba9a69048e6af617371a1305d98bb12c83659e86dcad1dce0 59004 Parallel version... (last 10 lines) fd27a0d46a8d835b52ab6bba351b8f977ac6a31ea26f2e56f558a4def0cf41c0 40657 290e4a68705cbb8ee95274f027bd17356438c4ab94c41b58ee72d22c0d6afb88 93564 5ae15e4077c349647cdc0f9f2272b91fdf6ab50f6ef67763b3c165ec925385e5 55783 8a5e75b8c173d1b466a2a66ba54853c317f8913a544c5d5ab547bff1dec5960f 59859 0b899aad6c918cd5a709c5bd5994c036f37509e8345ac503c2d059e766e06924 131072 4d0c6e8d974da6f53f959a55fbc735a484c09f6ce3dde5410ef505a590d3bbef 111800 02dc025bdb7afd7e0b32c545f1abd70015c728508c3aa6ec3a7cf08a6c52a250 89915 a602df3c6e333fdbe6434158945aa05ec82139bed8741336b128322423005112 29275 234ce19a584bb9996f3a909adada4dff8de8b30a296b8cbb4b31099af09501b3 36264 5090799ec0d6b1dba9a69048e6af617371a1305d98bb12c83659e86dcad1dce0 59004 Running reference benchmarks... Benchmark 1: target/release/examples/chunk --input /tmp/random_1.0gb.bin Time (mean ± σ): 1.169 s ± 0.017 s [User: 1.015 s, System: 0.151 s] Range (min … max): 1.149 s … 1.205 s 10 runs Running parallel benchmarks... Benchmark 1: target/release/examples/parallel-chunk --input /tmp/random_1.0gb.bin Time (mean ± σ): 217.4 ms ± 10.4 ms [User: 1317.2 ms, System: 264.6 ms] Range (min … max): 201.7 ms … 234.3 ms 14 runs**run on a Macbook M3 Max
As shown in the benches above. the user time (~compute time) is roughly the same in both cases but the wall clock time is >5x faster when the work can be distributed across multiple threads.
PS: the chunked boundaries and hash values appear to exactly match the reference in all cases when testing locally - however more tests to ensure correctness would increase confidence.
Opening this PR as a draft since it adds a new file, dependencies and a feature flag which may need to be refactored/changed. Looking forward to feedback and please let me know what needs to be updated/changed to complete this PR.